What is it?
This data can be downloaded from: https://www.kaggle.com/NUFORC/ufo-sightings It contains two files. There are two versions of this dataset: scrubbed and complete. The complete data includes entries where the location of the sighting was not found or blank (0.8146%) or have an erroneous or blank time (8.0237%). Since the reports date back to the 20th century, some older data might be obscured. Data contains city, state, time, description, and duration of each sighting.
What we are going to do?
We are going to plot some charts to better visualize the data we got.
In this post, however we shall be looking at one variable independent with others.
House keeping stuff
- Load library gglot , plyr and rworldmap in your R environment. If you have not installed the library you will first need to install it.
- Download and save the data to your machine
Now lets look at the data and see what we got. For simplicity lets display only first three rows.
head(ufo_data, n = 3)
Types of Variables
1. Quantitative variable
Those variables which can be counted or measured. Such variables fall into two categories. Discrete are those which you can count, while continuous are those which you can measure.
2. Qualitative variable
Variables which are in textual form. Nominal are those which are not measurable or do not show the relationship with each other such as gender, while those which show relationship are called ordinal such as basic / extreme, or high and higher.
Knowing the type of variable is important because it allows to make appropriate graphs and hence visualize the information.
We shall be looking at country , shape , duration , longitude and latitude columns. In the next post, we shall look at date and time variables.
Before we move forward we shall need to see how our data is stored in data frame.
print(is.factor(ufo_data$country))
[1] TRUE
print(is.factor(ufo_data$shape))
[1] TRUE
print(is.factor(ufo_data$duration))
[1] FALSE
print(is.factor(ufo_data$longitude))
[1] FALSE
print(is.factor(ufo_data$latitude))
[1] TRUE
We can see that “country”, “shape” and “latitude” are factors. So lets convert them into primitive data types
ufo_data$shape = as.character(ufo_data$shape)
ufo_data$country = as.character(ufo_data$country)
ufo_data$latitude = as.numeric(ufo_data$latitude)
Looking at Longitude / Latitude column
Lets plot the longitude and latitude to see where our UFOs are being sighted.
map <- getMap()
plot(map) + points(ufo_data$longitude, ufo_data$latitude, pch=19, col="blue", cex=.3)
integer(0)

NA
Now through this plot, we can see where ufo sightings have been recorded. We shall see more about ploting spatial plots in some other post.
Analyzing shape column
First lets see what unique shapes we have got. We can see that an empty item is present. So we replace it with not specified.
unique(x = ufo_data$shape)
[1] "cylinder" "light" "circle" "sphere" "disk" "fireball" "unknown"
[8] "oval" "other" "cigar" "rectangle" "chevron" "triangle" "formation"
[15] "" "delta" "changing" "egg" "diamond" "flash" "teardrop"
[22] "cone" "cross" "pyramid" "round" "crescent" "flare" "hexagon"
[29] "dome" "changed"
ufo_data$shape[ufo_data$shape == ""] <- 'Not specified'
shape_data <- data.frame( count(df = ufo_data, vars = "shape"))
ggplot(data = shape_data, aes(x=shape, y=freq)) +
ggtitle("Looking at Shape") +
xlab("Shape") +
ylab("Count") +
geom_bar(fill="steelblue",stat="identity" ) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

So through this plot, we can see that unkown and not specified actually add upto significant number. If we were performing any regression analysis, we would surely club these together into single variable. Similary crescent , pyramid and round could be dealt this way as well.
Analyzing country column
Lets look at what countries we got in our dataset.
unique(ufo_data$country)
[1] "us" "" "gb" "ca" "au" "de"
Hmmm… Does not make sense. Since we dont want to use short forms for countries, lets encode them to what they actually may be.
ufo_data$extendedCountry[ufo_data$country == "us"] <- 'USA'
ufo_data$extendedCountry[ufo_data$country == "gb"] <- 'United Kingdom'
ufo_data$extendedCountry[ufo_data$country == "ca"] <- 'Canada'
ufo_data$extendedCountry[ufo_data$country == "au"] <- 'Australia'
ufo_data$extendedCountry[ufo_data$country == "de"] <- 'Germany'
ufo_data$extendedCountry[ufo_data$country == ""] <- 'Unknown'
Once done, lets look at the counts for each of the country.
country_count = count(df = ufo_data, vars = "extendedCountry")
print(country_count)
No idea, why UFOs are most sighted in USA … But so many unknown locations are present. The number is high so any statistical analysis cant be performed without ignoring or catering for these locations.
Since we have spatial data we can very effectively display it on map. Although there are many other ways to visualize the information such as pie charts or donut charts, but display on map is probably the most interactive way.
pie_chart_data = data.frame(group = country_count$extendedCountry, value = country_count$freq)
matched <- joinCountryData2Map(pie_chart_data, joinCode="NAME", nameJoinColumn="group")
5 codes from your data successfully matched countries in the map
1 codes from your data failed to match with a country code in the map
238 codes from the map weren't represented in your data
mapCountryData(matched, nameColumnToPlot="value", mapTitle="UFO Sightings in the World", catMethod = "quantiles", colourPalette = "topo", oceanCol="aliceblue")

Now suppose if we wanted to zoom into map for some particular region we would use the following code:
mapCountryData(matched, nameColumnToPlot="value", mapTitle="UFO Sightings in the World", catMethod = "quantiles", colourPalette = "topo", oceanCol="aliceblue", mapRegion = "Australia")

Conclusion
In this post we saw how we can look at the data for pre analysis and modify it as per our requirements. We also saw how we can plot graphs for only one variable and perform simple aggregations.
---
title: "Plotting UFO Sightings data"
output: html_notebook
---

### What is it? 
This data can be downloaded from: https://www.kaggle.com/NUFORC/ufo-sightings
It contains two files. There are two versions of this dataset: scrubbed and complete. The complete data includes entries where the location of the sighting was not found or blank (0.8146%) or have an erroneous or blank time (8.0237%). Since the reports date back to the 20th century, some older data might be obscured. Data contains city, state, time, description, and duration of each sighting.

### What we are going to do?
We are going to plot some charts to better visualize the data we got. 

In this post, however we shall be looking at one variable independent with others.

### House keeping stuff
  1. Load library <b> gglot </b>, <b> plyr </b> and <b> rworldmap </b> in your R environment.
     If you have not installed the library you will first need to install it.
  2. Download and save the data to your machine


```{r include=FALSE}
library(ggplot2)
library(plyr)
library(rworldmap)

ufo_data <- read.csv('C:\\Users\\Umair Khan\\Desktop\\Datascience Portfolio\\ufo-sightings\\scrubbed.csv')

```

Now lets look at the data and see what we got. For simplicity lets display only first three rows.

```{r echo=TRUE}
head(ufo_data, n = 3)
```

### Types of Variables

#### 1. Quantitative variable
  
  Those variables which can be counted or measured. Such variables fall into two categories.       Discrete are those which you can count, while continuous are those which you can measure.
  
#### 2. Qualitative variable
  
  Variables which are in textual form. Nominal are those which are not measurable or do not show   the relationship with each other such as gender, while those which show relationship are called   ordinal such as basic / extreme, or high and higher.

Knowing the type of variable is important because it allows to make appropriate graphs and hence visualize the information.

We shall be looking at <b> country </b>, <b> shape </b>, <b>duration </b>, <b> longitude and latitude </b> columns.
In the next post, we shall look at date and time variables.

Before we move forward we shall need to see how our data is stored in data frame. 


```{r}
print(is.factor(ufo_data$country))
print(is.factor(ufo_data$shape))
print(is.factor(ufo_data$duration))
print(is.factor(ufo_data$longitude))
print(is.factor(ufo_data$latitude))
```

We can see that "country", "shape" and "latitude" are factors. So lets convert them into primitive data types


```{r}
ufo_data$shape = as.character(ufo_data$shape)
ufo_data$country = as.character(ufo_data$country)
ufo_data$latitude = as.numeric(ufo_data$latitude)
```

### Looking at Longitude / Latitude column

Lets plot the longitude and latitude to see where our UFOs are being sighted.

```{r echo=TRUE}
map <- getMap()
plot(map) + points(ufo_data$longitude, ufo_data$latitude, pch=19, col="blue", cex=.3)
 
```

Now through this plot, we can see where ufo sightings have been recorded. We shall see more about ploting spatial plots in some other post.

### Analyzing shape column

First lets see what unique shapes we have got. We can see that an empty item is present. So we replace it with not specified.

```{r}
unique(x = ufo_data$shape)
```

```{r}
ufo_data$shape[ufo_data$shape == ""] <- 'Not specified'
shape_data <- data.frame( count(df = ufo_data, vars = "shape"))
ggplot(data = shape_data, aes(x=shape, y=freq)) + 
  ggtitle("Looking at Shape") +
  xlab("Shape") + 
  ylab("Count") +
  geom_bar(fill="steelblue",stat="identity" ) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

```

So through this plot, we can see that <b> unkown </b> and <b> not specified </b> actually add upto significant number. If we were performing any regression analysis, we would surely club these together into single variable. Similary  <b> crescent </b>,  <b> pyramid </b> and  <b> round </b> could be dealt this way as well.


### Analyzing country column

Lets look at what countries we got in our dataset.

```{r}
unique(ufo_data$country)

```

Hmmm... Does not make sense. Since we dont want to use short forms for countries, lets encode them to what they actually may be.

```{r}
ufo_data$extendedCountry[ufo_data$country == "us"] <- 'USA'
ufo_data$extendedCountry[ufo_data$country == "gb"] <- 'United Kingdom'
ufo_data$extendedCountry[ufo_data$country == "ca"] <- 'Canada'
ufo_data$extendedCountry[ufo_data$country == "au"] <- 'Australia'
ufo_data$extendedCountry[ufo_data$country == "de"] <- 'Germany'
ufo_data$extendedCountry[ufo_data$country == ""] <- 'Unknown'
```

Once done, lets look at the counts for each of the country.

```{r}
country_count = count(df = ufo_data, vars = "extendedCountry")
print(country_count)

```

No idea, why UFOs are most sighted in USA ... But so many unknown locations are present. The number is high so any statistical analysis cant be performed without ignoring or catering for these locations.

Since we have spatial data we can very effectively display it on map. Although there are many other ways to visualize the information such as pie charts or donut charts, but display on map is probably the most interactive way.

```{r}
pie_chart_data = data.frame(group = country_count$extendedCountry, value = country_count$freq)
matched <- joinCountryData2Map(pie_chart_data, joinCode="NAME", nameJoinColumn="group")
mapCountryData(matched, nameColumnToPlot="value", mapTitle="UFO Sightings in the World", catMethod = "quantiles", colourPalette = "topo", oceanCol="aliceblue")

```

Now suppose if we wanted to zoom into map for some particular region we would use the following code:

```{r}
mapCountryData(matched, nameColumnToPlot="value", mapTitle="UFO Sightings in the World", catMethod = "quantiles", colourPalette = "topo", oceanCol="aliceblue", mapRegion = "Australia")
```


### Conclusion

In this post we saw how we can look at the data for pre analysis and modify it as per our requirements. We also saw how we can plot graphs for only one variable and perform simple aggregations.

